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Abstract. Graphs are commonly used to characterise interactions be¬ 
tween objects of interest. Because they are based on a straightforward 
formalism, they are used in many scientific fields from computer science 
to historical sciences. In this paper, we give an introduction to some 
methods relying on graphs for learning. This includes both unsupervised 
and supervised methods. Unsupervised learning algorithms usually aim 
at visualising graphs in latent spaces and/or clustering the nodes. Both 
focus on extracting knowledge from graph topologies. While most exist¬ 
ing techniques are only applicable to static graphs, where edges do not 
evolve through time, recent developments have shown that they could be 
extended to deal with evolving networks. In a supervised context, one 
generally aims at inferring labels or numerical values attached to nodes 
using both the graph and, when they are available, node characteristics. 
Balancing the two sources of information can be challenging, especially as 
they can disagree locally or globally. In both contexts, supervised and un¬ 
supervised, data can be relational (augmented with one or several global 
graphs) as described above, or graph valued. In this latter case, each 
object of interest is given as a full graph (possibly completed by other 
characteristics). In this context, natural tasks include graph clustering (as 
in producing clusters of graphs rather than clusters of nodes in a single 
graph), graph classihcation, etc. 

1 Real networks 

One of the first practical studies on graphs can be dated back to the original 
work of Moreno m in the 30s. Since then, there has been a growing interest 
in graph analysis associated with strong developments in the modelling and 
the processing of these data. Graphs are now used in many scientific fields. 
In Biology [MlllIZ], for instance, metabolic networks can describe pathways of 
biochemical reactions m, while in social sciences networks are used to represent 
relation ties between actors [551 IMl IMl IM] • Other examples include powergrids 
m and the web m- Recently, networks have also been considered in other 
areas such as geography [22] and history [59l [39] . In machine learning, networks 
are seen as powerful tools to model problems in order to extract information 
from data and for prediction purposes. This is the object of this paper. For 
more complete surveys, we refer to [551 [511 HU Si]. 

In this section, we introduce notations and highlight properties shared by 
most real networks. In SectionjH we then consider methods aiming at extracting 
information from a unique network. We will particularly focus on clustering 
methods where the goal is to find clusters of vertices. Finally, in Section |H 
techniques that take a series of networks into account, where each network is 


seen as an object, are investigated. In particular, distances and kernels for graphs 
are discussed. 

1.1 Notations 

A graph is first characterised by a set V of N vertices and a set £ of edges 
between pairs of vertices. The graph is said to be directed if the pairs (i,j) 
in £ are ordered, undirected otherwise. A graph with self loops is made of 
vertices which can be connected to themselves. The degree of a vertex i is the 
total number of edges connected to i, with self loops counted twice. In most 
applications, only the presence or absence of an edge is characterised. However, 
edges can also be weighted by a function /i : f F for any set F. More generally 
arbitrary labelling functions can be defined on both the vertices and the edges, 
leading to labelled graphs. 

A graph is usually described by an N x iV adjacency matrix (X)^ where 
is the value associated to the edge between the {i,j) pair. It is equal to zero in 
the absence of relationship between the nodes. In the case of binary graphs, the 
matrix X is binary and = 1 indicates that the two vertices are connected. 
If the graph is directed then X is symmetric that is Xij = Xji for all {i,j)- 

We use interchangeably the vocabulary from graph theory introduced above 
and a less formal vocabulary in with a graph is called a network and a vertex 
a node. In general, the network is the real world object while the graph is its 
mathematical representation, but we have a more relaxed use of the terms. 

1.2 Properties 

A remarkable characteristic of most real networks is that they share common 
properties [20113 [67l El] . First, most of them are sparse i.e. the number of edges 
present in not quadratic in the number of vertices, but linear. Thus, the mean 
degree remains bounded when N increases and the network density, defined as 
the ratio between the number of existing edges over the number of potential 
edges, tends to zero. Second, while some vertices of a real network can have few 
connections or no connection at all with the other vertices, most vertices belong 
to a single component, so called giant component, where it is always possible to 
find a path, i.e. a set of adjacent connected edges, connecting any pair of nodes. 
Nodes can be disconnected from this component, forming significantly smaller 
components. Finally, we would like to highlight the degree heterogeneity and 
small world properties. The first property states that few vertices have a lot 
of links, while most of the vertices have few connections. Therefore, scale free 
distributions are often considered to model the degrees mm- The second one 
indicates that the shortest path from one vertex to another is generally rather 
small, typically of size O (log (IV)). 


2 Graph clustering 


In order to extract information from a unique graph, unsupervised methods 
usually look for cluster of vertices sharing similar connection profiles, a particular 
case of general vertices clustering [63]. They differ in the way they define the 
topology on top of which clusters are built. 

2.1 Community structure 

Most graph clustering algorithms aim at uncovering specific types of clusters, 
so called communities, where there are more edges between vertices of the same 
community than between vertices of different communities. Thus, communities 
appear in the form of densely connected clusters of vertices, with sparser con¬ 
nections between groups. They are characterised by the friend of my friend 
is my friend effect, i.e. a transitivity property, also called assortative mixing 
effect. Two families of methods for community discovering can be singled out 
among a vast set of methods [24] . depending on wether they maximize a score 
derived from the modularity score of Girvan and Newman m or rely on the 
latent position cluster model (LPCM) of Handcock, Raftery and Tantrum [34] . 

2.1.1 Modularity score 

A series of community detection algorithms have been proposed (see for instance 
[IZII^ISS] and the survey [24] 1. They involve iterative removal of edges from 
the network to detect communities where candidate edges for removal are chosen 
according to betweenness measures. All measures rely on the same idea that two 
communities, by definition, are joined by a few edges and therefore, all paths 
from vertices in one community to vertices in the other are likely to path along 
these few edges. Therefore, the number of paths that go along an edge is expected 
to be larger for inter community edges. For instance, the edge betweenness of an 
edge does account for the number of shortest paths between all pairs of vertices 
that run along that edge. Moreover, the random walk betweenness evaluates 
the expected number of times a random walk would path along the edge, for all 
pairs of vertices. 

The iterative removal of edges produces a dendrogram, describing a hierar¬ 
chical structure, from a situation where each vertex belongs to a different cluster 
to the inverse scenario where all vertices are clustered within the same commu¬ 
nity. The modularity score [27] is then considered to select a particular division 
of the network into K clusters. Denoting eki the fraction of edges in the network 
connecting vertices of communities k and I, as well as Ofc = J^iLi '^ki, the frac¬ 
tion of edges that connect with vertices of community fc, the modularity score 
is given by: 

K 

^mod — ^ ^k)' 


Such a criterion is computed for the different levels in the hierarchy and K is 
chosen such that K^od is maximised. 

Rather that building the complete dendrogram, other algorithms have fo¬ 
cused on optimising the modularity score directly, as it is beneficial both in 
computational terms and in the perceived quality of the obtained partitions. A 
very popular algorithm, the so-called Louvain method m. proceeds by a se¬ 
ries of greedy exchanges and merging that turns a fully refined partition into 
a coarser one that provides a (local) maximum of the modularity. Better solu¬ 
tions can be obtained using more sophisticated heuristics but maximising 
the modularity is a NP-hard problem [12] . 

Note that modularity approaches have been shown to be asymptotically bi¬ 
ased [5] . To tackle this issue, degree corrected methods were introduced in order 
to take the degrees of nodes into account. 

2.1.2 Latent position duster model 

Alternative approaches, looking for clusters of vertices with assortative mixing, 
usually rely on the LPCM model [53] which is a generalisation of the latent 
position model (LPM) [55]. In the original LPM model, each vertex i is first 
assumed to be associated with a position in a Euclidean latent space 
Each edge between a pair {i,j) of vertices is then drawn depending on and Zj. 
Both maximum likelihood and Markov chain Monte Carlo (MCMC) techniques 
were considered to estimate the model parameters and the latent positions. The 
corresponding mapping of the vertices into the Euclidean latent space produces 
a representation of the network such that nodes which are likely to be connected 
have similar positions. Note that if the latent space is low dimensional, typically 
of dimension d = 1,2,3, then the representation can be visualised which is 
feature appreciated by practitioners. 

The LPM model was extended in order to look for both a representation of 
the network and a clustering of the vertices. Thus, the corresponding LPCM 
model assumes that the positions are drawn from a Gaussian mixture model in 
the latent space such that each Gaussian distribution corresponds to a cluster. A 
two stage maximum likelihood approach along with a Bayesian MCMC scheme 
were proposed for inference purposes. Moreover, conditional Bayes factors were 
considered to estimate the number of clusters from the data. Finally variational 
bayesian inference is also possible m- 

2.2 Heterogeneous structures 

So far, we have discussed methods looking exclusively for communities in net¬ 
works. Other approaches usually derive from the stochastic block model (SBM) 
of Nowicki and Snijders [56] . They can also look for communities, but not only. 

The SBM models assumes that nodes are spread in unknown clusters and 
that the probability of a connection between two nodes i and j depends on 
their corresponding clusters. In practice, a latent vector Z^ is drawn from a 
multinomial distribution with parameters (1,Q! = {ai,where ak is 


the proportion of cluster k. Therefore, is a binary vector of size K with a 
single 1, such that Zik = 1 indicates that i belongs to cluster k, 0 otherwise. If i is 
in cluster k and j in 1, then the SBM model assumes that there is a probability 
TTki of a connection between the two nodes. All connection probabilities are 
characterised hy a, K x K matrix 11. Note that a community structure can be 
defined by setting values for the diagonal terms of 11 to higher values than extra 
diagonal terms m- In practice, because no assumptions are made regarding 11, 
the SBM model can take heterogeneous structures into account [181 SH @1] ■ 

While generating a network with such a sampling scheme is straightforward, 
estimating the model parameters a and 11 as well as the set (Z)i of all latent 
vectors is challenging. One of the key issue is that the posterior distribution 
of Z given the adjacency matrix X and the model parameters (a, 11) cannot 
be factorised due to conditional dependency. Therefore, standard optimisation 
algorithms, such as the expectation maximisation (EM) algorithm, cannot be 
derived. To tackle this issue variational and stochastic approximations have 
been proposed. Thus, [Hj relied on a variational EM (VEM) algorithm whereas 
[H] used a variational Bayes EM (VBEM) approach. Alternatively, |5S] esti¬ 
mated the posterior distribution of the model parameters and Z, given X, by 
considering Gibbs sampling. 

A even more fondamental question concerns the estimation of the number of 
clusters present in the data. Unfortunately, since the likelihood is not tractable 
either, standard model selection criteria, like the Akaike information criterion 
(AIC) or the Bayesian IC (BIG) cannot be computed. Again, variational along 
with asymptotic Laplace approximations were derived to obtain approximate 
model selection criteria [iHl |44] . 

In some cases, the clustering of the nodes and the estimation of the number 
of clusters are performed at the same time using allocation sampler m, greedy 
search [TB], or non parametric schemes [ID]. 

2.3 Extensions 

Since the original development of the SBM model, many extensions have been 
proposed to deal for instance with valued edges [48] or to take into account 
covariate information m |49]. The random subgraph model (RSM) [39] for 
instance assumes that a partition of the nodes into subgraphs is observed and 
that the subgraphs are made of (unknown) latent clusters, as in the SBM model, 
with various mixing proportions. The edges are typed. In parallel, strategies 
looking for overlapping clusters, where each node can belong to multiple clusters, 
have been derived. In [1], a vertex i belongs a cluster in its relation with a given 
vertex j. Because i is involved in multiple relations in the network, it can 
belong to more than one cluster. In [43] . the multinomial distribution of the 
SBM model is replaced with a product of Bernoulli distribution, allowing each 
vertex to belong to no, one, or several clusters. 

In the last few years, a lot of attention has been paid on extending the ap¬ 
proaches mentioned previously in order to deal with dynamic networks where 
nodes and/or edges can evolve through time. The main idea consists in in- 


troducing temporal processes, such as hidden Markov model (HMM) or linear 
dynamic systems [nizaizs]. While models usually focus on modelling the dy¬ 
namic of networks through the evolution of their latent structures, Heaukulani 
and Gharamani [35] chose to define how observed social interactions can affect 
future unobserved latent structures. We would also like to highlight the work 
of Dubois, Butts, and P. Smyth m- Contrary to most dynamic clustering ap¬ 
proaches, they considered a non homogeneous Poisson process allowing to deal 
with a continuous time periods where events, i.e. the creation or removal of 
an edge, can occur one at a time. Another approach for graph clustering in the 
continuous time context is provided by [29] which builds a coclustering structure 
on the vertices of the graph and on the time stamps of the edges. 

3 Multiple graphs 

While a large part of the graph related literature in machine learning targets the 
case of a single graph, numerous applications lead naturally to data sets made 
of graphs, that is situations in which each data point is a graph (or consists in 
several components including at least one graph). This is the case for instance 
in chemistry where molecules can be represented by undirected labelled graphs 
(see e.g. Ez]) and in biology where the structure of a protein can be represented 
by a graph that encodes neighborhoods between it fragments as in m- In fact, 
the use of graphs as structured representations of complex data follows a long 
tradition with early examples appearing in the late seventies [60] and with a 
tendency to become pervasive in the last decade. 

It should be noted that even in the case of a single global graph described in 
the first part of this paper, it is quite natural to study multiple graphs derived 
from the global one, in particular via the ego-centered approach which is very 
common in social sciences (see e.g. [26]). The main idea is to extract from 
a large social network a set of small networks centered on each of the vertices 
under study. For real world social networks, it is in general the only possible 
course of action, the whole network being impossible to observe (see e.g. [19] for 
an example). 

When dealing with multiple graphs, one tackles the traditional tasks of ma¬ 
chine learning, from unsupervised problems (clustering, frequent patterns anal¬ 
ysis, etc.) to supervised ones (classification, regression, etc.). There are two 
main tendencies in the literature: the design of specialized methods obtained by 
adapting classical ones to graphs and the use of distances and kernels coupled 
with generic methods. 

3.1 Specialized methods 

As graphs are not vector data, classical machine learning techniques do not 
apply directly. Numerous methods have been adapted in rather specihc ways 
to handle graphs and other non vector data, especially in the neural network 
community EKm, for instance via recursive neural networks as in [331130] . In 
those approaches, each graph is processed vertex by vertex, by leveraging the 


structure to build a form of abstract time. The recursive model maintains an 
implicit knowledge of the vertices already processed by means of its space state 
neurons. 

3.2 Distances and kernels 

A somewhat more generic solution consists in building distances (or dissimi¬ 
larities) between graphs and then in using distances based methods (such as 
methods based on the so-called relational approach [3T]). One difficulty is that 
graph isomorphism should be accounted for when two graphs are compared: two 
graphs are isomorphic if they are equal up to a relabeling of their vertices. Any 
sound dissimilarity/distance between graphs should detect isomorphic graphs. 
This is however far more complex than expected |23] up to a point that the 
actual complexity class of the graph isomorphism problem remains unknown (it 
belongs to the NP class but not to the NP-complete class, for instance). While 
exact algorithms appear fast on real world graphs, their worst case complexities 
are exponential, with a best bound in 0(2'/"'^°s”) [47]. In addition, subgraph 
isomorphism, i.e. determining whether a given graph contains a subgraph that 
is isomorphic to another graph is NP-complete. 

Nevertheless, numerous exact or approximate algorithms have been defined to 
try and solve the (sub)graph isomorphism problem (see e.g. [2]). In particular, 
it has been shown that those problems (and related ones) are special cases of 
the computation of the graph edit distance m, a generalization of the string 
edit distance |3S] . The graph edit distance is defined by first introducing edition 
operations on graph, such as insertion, deletion and substitution of edges and 
vertices (labels and weights included). Each operation is assigned a numeric 
cost. The total cost of a series of operations is simply the sum of the individual 
costs. Then the graph edit distance between two graphs is the cost of the least 
costly sequence of operations that transforms one of the graph into the other 
one (see jlSj for a survey). 

A rather different line of research has provided a set of tools to compare 
graphs by means of kernels (as in reproducing kernels ID)- Those symmetric 
positive definite similarity functions allow one to generalize any classical vector 
space method to non vector data in a straightforward way |64] . Numerous of 
such kernels have been defined to compare two graphs [03] ■ Most of them are 
based on random walks that take place on the product of the two graphs under 
comparison. 

Once a kernel or a distance has been chosen, one can apply any of the kernel 
methods [65] or of the relational methods m, which gives access to support 
vector machine, kernel ridge regression and kernel k-means to cite only a few. 

4 Conclusion 

This paper has only scraped the surface of the vast literature about graphs in 
machine learning. Complete area of graph applications in machine learning were 
ignored. 



For instance, it is well know that extracting a neighborhood graph from a 
classical vector data set is an efficient way to get insights on the topology of 
the data set. This has led to numerous interesting applications ranging from 
visualization (as in isomap |68] and its successors) to semi-supervised learning 
[8], going through spectral clustering [70] and exploratory analysis of labelled 
data sets [5]. 

Another interesting area concerns the so-called relational data framework 
when a classical data set is augmented with a graph structure: the vertices of 
the graph are elements of a standard vector space and are thus traditional data 
points, but they are interconnected via a graph structure (or several ones in 
complex settings). The challenge consists here in taking into account the graph 
structure while processing the classical data or vice-versa in taking into account 
the data point descriptions when processing the graph. Among other issues, 
those two different sources of information can be contradictory for a given task. 
A typical application of such a framework consists in annotating nodes on social 
media [38] . 

While we have presented some temporal extensions of classical graph related 
problems, we have ignored most of them. For instance, the issue of informa¬ 
tion propagation on graphs has received a lot of attention [58]. Among other 
tasks, machine learning can be used e.g. to predict the probability of passing 
information from one actor to another as in m- 

More generally, the massive spread in the last decade of online social net¬ 
working has the obvious consequence of generating very large relational data sets. 
While non vector data have been studied for quite a long time, those new data 
sets push the complexity one step further by mixing several types of non vector 
data. Objects under study are now described by complex mixed data (texts, 
images, etc.) and are related by several networks (friendship, online discussion, 
etc.). In addition, the temporal dynamic of those data cannot be easily ignored 
or summarized. It seems therefore that the next set of problems faced by the 
machine learning community will include graphs in numerous forms, including 
dynamic ones. 
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